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ABSTRACT 

Building a visual summary from an egocentric photostream 
captured by a lifelogging wearable camera is of high inter¬ 
est for different applications (e.g. memory reinforcement). In 
this paper, we propose a new summarization method based 
on keyframes selection that uses visual features extracted by 
means of a convolutional neural network. Our method applies 
an unsupervised clustering for dividing the photostreams into 
events, and finally extracts the most relevant keyframe for 
each event. We assess the results by applying a blind-taste 
test on a group of 20 people who assessed the quality of the 
summaries. 

Index Terms — egocentric, lifelogging, summarization, 
keyframes 

1. INTRODUCTION 

Lifelogging devices offer the possibility to record a rich set of 
data about the daily life of a person. A good example of this 
are wearable cameras, that are able to capture images from an 
egocentric point of view, continuously and during long peri¬ 
ods of time. The acquired set of images comes in two formats 
depending on the device used: 1) high-temporal resolution 
videos, which usually produce more than 30fps and capture 
a lot of dynamical information, but they are only capable of 
storing some hours of data, or 2) low-temporal resolution pho¬ 
tostreams, which usually produce only 1 or 2 fpm, but are 
able to capture events that happen during a whole day (having 
around 16 hours of autonomy). 

Being able to automatically analyze and understand the 
large amount of visual information provided by these devices 
would be very useful for developing a wide range of appli¬ 
cations. Some examples could be building a nutrition diary 
based on what, where and in which conditions the user eats 
for keeping track of any possible unhealthy habit, or provid¬ 
ing an automatic summary of the whole day of the user for 
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offering a memory aid to mild cognitive impairment (MCI) 
patients by reactivating their memory capabilities (H . 
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Fig. 1. Scheme of the proposed visual summarization. 

In our work, we focus on automatic extraction of a good 
summary that can be used as a memory aid for MCI patients. 
Usually, these patients suffer neuron degradation that gen¬ 
erates them problems to recognize familiar people, objects 
and places ID- Hence, the visual summary, automatically ex¬ 
tracted, should be clear and informative enough to recall the 
daily activity with a simple glimpse. 

In order to take into account our ultimate goal, we pro- 
























































pose an approach that starts by extracting a set of features 
for frames characterization by means of a convolutional neu¬ 
ral network. These visual descriptors are used to segment 
events by running an agglomerative clustering, which is post- 
processed to guarantee a temporal coherency (similar to 111 ). 
Finally, a representative keyframe for each event is selected 
using the Random Walk m or Minimum Distance m algo¬ 
rithms. The overall scheme is depicted in Figure 

This paper is structured as follows. Section overviews 
previous work for event segmentation and summarization in 
the field of egocentric video. Our approach is described in 
Section and its quantitative and qualitative evaluation in 
Section]^ Finally, Sectiondraws the final conclusions and 
outlines our future work. 


2. RELATED WORK 

The two main problems addressed in this paper, event seg¬ 
mentation and summarization, have been addressed in related 
egocentric data works, as presented in this section. 


2.1. Egocentric event segmentation 

Most existing techniques agree that the first step for a sum¬ 
mary construction is a shot- or event-based segmentation of 
the photostream or video. Lu and Grauman in m andBolanos 
et al. in m both propose event segmentation that relies on 
motion information, colour and blurriness, integrated in an 
energy-minimization technique. The result is final event seg¬ 
mentation that is able to capture the different motion-related 
events that the user experiences. In the former approach O, 
the authors use high-temporal videos and an optical flow de¬ 
scriptor for characterizing the neighbouring frames. In the 
latter one instead of working with low-temporal data, a 
SIFT-fiow descriptor is used, as it is more robust for capturing 
long-term motion relationships. Poleg et al. (O also propose 
motion-based segmentation, but they use a new method of Cu¬ 
mulative Displacement Curves for describing the motion be¬ 
tween neighbouring video frames. The proposed solution is 
able to focus on the forward user movement and removes the 
noise of the head motion produced by head-mounted wearable 
cameras. Other methods have been proposed using low-level 
sensor features like the work in ||9| that splits low-temporal 
resolution lifelogs in events. Lin and Hauptmann O also pro¬ 
pose a simple approach based on using colour features in a 
Time-Constrained K-Means clustering algorithm for keeping 
temporal coherence. In ca, Talavera et al. design a seg¬ 
mentation framework also based on an energy minimization 
framework. In this case, the authors offer the possibility to 
integrate different clustering and segmentation methods, of¬ 
fering more robust results. 


2.2. Egocentric summarization 

Focusing on the summarization of lifelog data after event seg¬ 
mentation, there are two basic research directions, both of 
them aiming at removing those data, which are redundant or 
low-informative. In the case of video recordings, it is a com¬ 
mon practice the select a subset of video segments to create 
a video summary. On the other hand, when working with 
devices that take single pictures at a low frame rate, the prob¬ 
lem is usually tackled by selecting the most representative 
keyframes. The most relevant work in the literature following 
the video approach is from Grauman et al. in |[6l[TTl, where 
a summary methodology for egocentric video sequences is 
proposed. The authors rely on an initial event segmentation, 
followed by the detection of salient objects and people, cre¬ 
ate a graph linking events and the important objects/people, 
and finish with a selection of a subset of the events of inter¬ 
est. This final selection is based on combining three different 
measures: 1) Story (choosing a set of shots that are able to fol¬ 
low the inherent story in the dataset), 2) Importance (aimed at 
choosing only shots that show some important aspect of the 
day) and 3) Diversity (adding a way to avoid repeating simi¬ 
lar actions or events in the summary). When considering the 
keyframe selection approach, one of the most relevant works 
is by Doherty et al. , where the authors study various selec¬ 
tion methods like: 1) getting the frame in the middle of each 
segment, 2) getting the frame that is the most similar w.r.t. 
the rest of the frames in the event, or 3) selecting the closest 
frame to the event average. 

3. METHODOLOGY 

This section presents our methodology for keyframe-based 
summarization of egocentric photostreams, depicted in Fig¬ 
ure!^ We start by characterizing each of the lifelog frames 
with a global scale visual descriptor. These features are used 
to create a visual-based event segmentation, which incorpo¬ 
rates a post-processing step to guarantee time consistency. Fi¬ 
nally, the most visually repetitive frame is selected as the most 
representative of the event. 

3.1. Frames characterization 

Convolutional Neural Networks (convnets or CNNs) have re¬ 
cently outperformed hand-crafted features in several com¬ 
puter vision tasks (III [IS. These networks have the ability 
to learn sets of features optimised for a pattern recognition 
problem described by a large amount of visual data. 

The last layer of these convnets is typically a soft-max 
classifier, which in some works is ignored, and the penulti¬ 
mate fully connected layers are directly used as feature vec¬ 
tors. These visual features have been successfully used as 
any other traditional hand-crafted features for purposes such 
as image retrieval ca or classification (HI. 

In the field of egocentric video segmentation, convnets 
have also been proved as suitable for clustering purposes do). 


For this reason, we used a set of features extracted by means 
of the pre-trained CaffeNet convnet included in the Caffe li¬ 
brary csi. This convnet was inspired by ca and trained on 
ImageNet C3. In our case, we used as features the output of 
the penultimate layer, a fully connected layer of 4,096 com¬ 
ponents, discarding this way the final soft-max layer, which 
was intended to classify 1,000 different semantic classes from 
ImageNet. 

3.2. Events segmentation 

The egocentric photostream is segmented with an unsuper¬ 
vised hierarchical Agglomerative Clustering (AC) ca based 
on the convnet visual features. As proved in mni , this cluster¬ 
ing methodology reaches a reasonable accuracy for this task. 
In this way, we can define sets of images, each of them repre¬ 
senting a different event. AC algorithms can be applied with 
different similarity measures. Different configurations were 
tested (see details in Section |4.2[ ) and the best approach was 
obtained with the average linkage method with Euclidean dis¬ 
tance. This option determines the two most similar clusters to 
be fused in each iteration using the following distance: 

argminCi,CjeCt D{Ci,Cj), where (1) 

= |c,i X ic,-i ^ 

^ k i 1^1 ,j j 

where Ct is the set of clusters at iteration t, Sk,i and sij are 
the samples in cluster Ci and Cj, respectively, and f{s) are 
the visual features extracted by means of the convnet. 

However, creating the clusters based only on visual fea¬ 
tures often generates non-consistent solutions from a tempo¬ 
ral perspective. Typically, images captured in the same sce¬ 
nario will be visually clustered as a single event despite corre¬ 
sponding to separate moments. For example, frames from the 
beginning of the day, (e.g. when the user takes the train for 
commuting to work) may be visually indistinguishable with 
other frames from the end of the day (e.g. when the user is 
going back home by train too). Additionally, another usual 
problem when relying only on visual features is that some¬ 
times very small clusters can be generated, a result which 
should be avoided because an event is typically required to 
have a certain span in time (e.g. 3 minutes, in our work). 

In order to solve these problems, we introduce two post¬ 
processing steps for refining the resulting clusters: Division 
and Fusion. The Division step splits in different events those 
images in the same cluster which are temporally interrupted 
by events defined in other clusters. For example, the event 
in orange from Figure!^ a) is divided in two events (orange 
and yellow) in Figure |^c) due to a Corridor scene event (in 
green) interrupting the original Office scene. On the other 
hand, the second post-processing step. Fusion, will merge all 
those events shorter than a threshold with the closest neigh¬ 
boring event in time. 


3.3. Keyframe selection 

Once the photostream is split into the events, the next step is 
to carefully select a good subset of keyframes. To do so, we 
explored two different methods: random walk and a minimum 
distance approach. Both approaches are based on the assump¬ 
tion that the best photo to represent the event is the one, which 
is the most visually similar with the rest of the photos in the 
same cluster. As a result, each event can be automatically rep¬ 
resented by a single image and, when all images combined, 
they will provide a visual summary of the user’s day. 

3.3.1. Random Walk 

We propose to use the Random Walk algorithm Ql in each of 
the events, separately. As a result, the algorithm will select 
the photo, which is more visually similar to the rest of the 
photos in the event. After applying the same procedure for all 
the events, we can have a good general representation of the 
main events that happened in the user’s daily life. 

The Random Walk algorithm works as follows: 1) the vi¬ 
sual similarity for each pair of photos in the event is com¬ 
puted; 2) a graph described by a transitional probabilities ma¬ 
trix is built using the extracted similarities as weights on each 
of the edges; 3) the matrix eigenvectors are obtained, and 4) 
image associated to the largest value in the first eigenvec¬ 
tor is considered as the keyframe of the event. 

3.3.2. Minimum distance 

The second considered option selects the individual frame 
with the minimal accumulated distance with respect to all the 
other images in the same event. That is, let us consider the 
adjacency matrix A = {oi^j} = where dg-^sj is the 

Euclidean distance between the descriptors of images Si and 
Sj extracted by the convnet, i = ...N, j = ...N, where 

N is the number of frames of the event. Let us consider the 
vector V = Oij) of accumulated distances. One can eas¬ 
ily see that the index of the minimal component of vector v 
i.e. k = argmini{vi}, i = 1, ...A^ determines the closest 
frame to the rest of frames in the corresponding event with 
respect to the Li norm 0. 

4, RESULTS 

This section presents the quantitative and qualitative experi¬ 
ments run on a home-made egocentric dataset to assess the 
performance of the presented technique. 

4.1. Dataset 

Our experiments were performed on a home-made dataset of 
images acquired with a Narrativ^ wearable camera. This de¬ 
vice is typically clipped on the users’ clothes under the neck 

^WWW.getnarrative.com 
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Fig. 2. Example of the events labeling produced by a) simply using the AC algorithm, b) applying the division strategy and c) 
additionally applying the fusion strategy. Each color represents a different event. 


or around the chest area. The dataset, we used, is a subset of 
the one used by the authors in lITOl (not using the SenseCam 
sets). It is composed of 5 day lifelogs of 3 different persons 
and has a total of 4,005 images. Eurthermore, it includes the 
ground truth (GT) events segmentation for assessing the clus¬ 
tering results. 


4.2. Quantitative evaluation of event segmentation 

The first test assessed the quality of the photostream segmen¬ 
tation into events. In order to make this evaluation, we used 
the Jaccard Index, which is intended to measure the overlap 
of each of the resulting events and the GT the following way: 

J{E,GT)= (3) 

aeE gjeGT 

where E is the resulting set of events, GT is the ground truth, 
Ci and gj are a single event and a single GT segment respec¬ 
tively, and Mij is an indicator matrix with values 1, iff has 
the highest match with gj . 

We compared different cluster distance methods with re¬ 
spect to the chosen cut-off parameter (which determines how 
many clusters are formed considering their distance value) 
for the AC (see Eigure [^. We choose the ’’average” with 
cutoff = 1.154 as the best option and, with this configura¬ 
tion, we measured the gain of introducing the Division-Eusion 
strategy, illustrated in Eigure]^ 


4.3. Qualitative evaluation with blind-test taste 

The assessment of visual summaries of a day, like the one 
shown in Eigure is a challenging problem, because there is 
not a single solution for it. Different summaries of the same 
day may be considered equally satisfactory due to near dupli¬ 
cate images and subjectivity in the judgments. Therefore, we 
followed an evaluation procedure similar to the one adopted 
by Lu and Grauman m We designed a blind-taste test and 
asked to a group of 20 people to rate the output of different 
solutions, without knowing which of them corresponded to 
each configuration. 



Cutoff 


Fig. 3. Average Jaccard index value obtained for the 5 sets. 
We compare each of the methods after applying the division- 
fusion strategy with respect to the best cut-off AC values. 

4.3.1. Keyframe selection 

The first qualitative evaluation focused on the keyframe se¬ 
lection strategy, comparing both presented algorithms {Ran¬ 
dom Walk and Minimum Distance) with a third one. Random 
Baseline. In this first part, the three selection strategies were 
applied on each of the events defined by the GT annotation. 

On the first part, we showed to the user a complete 
event according to the GT labels and, afterwards, the three 
keyframes selected by the three methods under comparison in 
a random sorting Then, the user had to answer if each of 
the candidates was representative of the current event (results 
in Figure [^, and also choose which of them was the best one 
(results in Figure [^. This procedure was applied on each of 
the events of the day and results averaged per day. 

Scoring results presented in Figures and [^indicate how 
both proposed solutions consistently outperform the random 
baseline for each day. The difference is more remarkable, 
when we asked the user to choose between only one of the 
possibilities (Figure |7]). We must note that usually the result 
was very similar either for the Random Walk and the Min¬ 
imum Distance, since in most of the cases both algorithms 
selected the same keyframe. 

^If any of the results for the different methods was repeated, only one 
image was shown and the results were counted for both methods. 
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Fig. 4. Effect when using (dark blue) the division-fusion (DF) 
strategy and when not using it (light blue) in the average Jac- 
card index result for all the sets. 



Fig. 5. Example of one of the summaries obtained by apply¬ 
ing our approach on a dataset captured with Narrative camera. 
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Fig. 6. Results answering ”yes” to the question ”Is this image 
representative for the current event?” 
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Fig. 7. Results to the question ’’Which of the previous frames 
is the most representative for the event?” 

tively assessed by the graders. Moreover, when it comes to 
choose only the best summary, our method gathered 58% of 
the total votes if we consider that the voting is exclusive and 
that the summaries produced by the Random Walk and the 
Minima Distance methods are very similar. As a result, we 
obtained 34% and 41% of improvement respectively w.r.t the 
Random and the Uniform baselines. 


4.3.2. Visual daily summary 

In the second part of our qualitative study, we assessed the 
whole daily summary, built with the automatic event segmen¬ 
tation and the different solutions for keyframe selection. In 
this experiment, we added a fourth configuration that built a 
visual summary with a temporal Uniform Sampling of the day 
photostream, in such a way that the total amount of frames 
was the same as the amount of events detected through AC. 

This time the user was shown the four summaries of the 
day generated by the four configurations. Figure provides 
an example built with the Random Walk solution. For each 
summary, the user was firstly asked whether the set could rep¬ 
resent the day (results in Figure [^, and also which of the four 
was the one that better described the day (results in Figure [^. 

Focusing on the average results in Figure we can state 
that, either applying Random Walk (88%) or Minimal Dis¬ 
tance (86%), most of the generated summaries were posi- 


5. CONCLUSIONS 

In this work, we presented a new methodology to extract a 
keyframe-based summary from egocentric photostreams. Af¬ 
ter the qualitative validation made by 20 different users, we 
can state that our method achieves very good and representa¬ 
tive summary results from the final user point of view. 

Additionally, and always considering that the ultimate 
goal of this project is to reactivate the memory pathways of 
MCI patients, it offers satisfactory results in terms of captur¬ 
ing the main events of the daily life of the wearable camera 
users. A public-domain code developed for our visual sum¬ 
mary methodology, is published in[^ 
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https://imatge.upc.edu/web/publications/ 
visual-summary-egocentric-photostreams-representative-keyframes-0 
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Fig. 8. Results answering ”yes” to the question ’’Can this set 
of images represent the day?” 


■ Random Walk ■ Minimum Distance 
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Fig. 9. Results to the question ’’Which of the previous sum¬ 
maries does better describe the day?” 
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